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Abstract 

Regime switching volatility models provide a tractable method of modelling stochas- 
tic volatility. Currently the most popular method of regime switching calibration is the 
Hamilton filter. We propose using the Baum- Welch algorithm, an established technique 
from Engineering, to calibrate regime switching models instead. We demonstrate the 
Baum- Welch algorithm and discuss the significant advantages that it provides com- 
pared to the Hamilton filter. We provide computational results of calibrating the 
Baum- Welch filter to S&P 500 data and validate its performance in and out of sample. 

Key words: Regime switching, stochastic volatility, calibration, Hamilton filter, Baum- 
Welch. 



1. Introduction and Outline 

Regime switching (also known as hidden Markov models (HMM)) volatility models 
provide a tractable method of modelling stochastic volatility. Currently the most pop- 
ular method of regime switching calibration is the Hamilton filter. However, regime 
switching calibration has been tackled in engineering (particularly for speech process- 
ing) for some time using the Baum- Welch algorithm (BW), where it is the most popular 
and standar d method of HM M calibration. A review of the Baum- Welch algorithm can 



be found in Lev05| . JR91I ]. The BW algorithm is inc reasingly being applied beyond 



engineering applications (for instance in bioinformatics BEDE041 ]) but has been hardly 
applied to financial modelling, especially to regime switching stochastic volatility mod- 
els. 
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Unlike the Hamilton filter, the BW algorithm is capable of determining the entire 
set of HMM parameters from a sequence of observation data. Furthermore, BW is a 
complete estimation method since it also provides the required optimisation method 
to determine the parameters by MLE. 

The outline of the paper is as follows. Firstly, we introduce regime switching volatil- 
ity models and the Hamilton filter. In the next section we introduce the Baum- Welch 
method, describing the algorithm even for multivariate Gaussian mixture observations. 
We then conduct a numerical experiment to verify the Baum- Welch method's to detect 
regimes for the S&P 500 index. We finally end with a conclusion. 



2. Regime Switching Volatility Model and Calibration 

2.1. Regime Switching Volatility 

Wiener process driven stochastic volatility models capture price and volatility dy- 
namics more successfully compared to previous volatility models. Specifically, such 
models successfully capture the short term volatility dynamics. However, for longer 
term dynamics and fundamental economic changes (e.g. "credit crunch"), no mecha- 
nism existed to address the change in volatility dynamics and it has been empirically 
shown t hat volatility is related to long term and fundamental conditions. Bekaert in 

BHL06| claims that volatility changes are caused by economic reforms, for example 
on Black Wednesday the pound sterling was withdrawn from the ERM (European 

xchange Rate M echanis m) , causing a sudden change in value of the pound sterling 



BRQ2J. Schwert Sch89l | empirically shows that volatility increases during financial 



crises. 



A class of models that address fundamental and long term volatility modell i ng; is the 
regime switching model (or hi dden M arkov model) e.g. as discusssed in TimOcJ ] . EvdH97 ] 
In fact, Schwert suggests in |Sch89j ] that volatility changes during the Great Depres- 
sion ca n be acc ounted for by a regime change such as in Hamilton's regime switching 



model 



Ham89| . Regime switching is considered a tractable method of modelli ng price 



dynamics and does not violate Fama's "Efficient Market Hypothesis" Fam65l| . which 
claims that price processes must follow a Markov process. Hamilton [Ham89| was the 
first to introduce regime switching models, which was applied to specifically model 
fundamental economic changes. 

For regime switching models, generally the return distribution rather than the con- 
tinuous time p rocess i s specified. A typical example of a regime switching model is 
Hardy's model Har0lj |: 



log((X(t + 1)/X(t))\i) ~ Af(v,i, <pi),ie {1, .., R}, 



2 



where 



• (fi and Ui are constant for the duration of the regime; 

• i denotes the current regime (also called the Markov state or hidden Markov 
state); 

• R denotes the total number of regimes; 

• a transition matrix A is specified. 

For Hardy's model the regime changes discretely in monthly time steps but stochasti- 
cally, according to a Markov process. 

Due to the ability of regime switching models to capture long term and fundamental 
changes, regime switching models are primarily focussed on modelling the long term 
behaviour, rather than the continuous time dynamics. Therefore regime switching 
models switch regimes over time periods of months, rather than switching in continuous 
time. Examples of regime swit ching mo dels that model dynamics over shorter time 
periods are Valls-Pereira et al. VPHS04I|. w ho propose a regime switching GARCH 
process, while Hamilton and Susmel HS94J ] give a regime switching ARCH process. 
Note that economic variables other than stock returns, such as inflation, can also be 
modelled using regime switching models. 

Regi me swit ching has been developed by various researchers. For example, Kim 
and Yoo KY95I ] de velop a multivariate regime switching model for coincident economic 



indicators. Honda 



Hon03| determines the optimal portfolio choice in terms of utility 



for assets following GB M bu t with continuous time regime switching mean returns. 



Alexander and Kaeck 



AK08 



apply regime switching to credit default swap spreads, 



Durland and McCurdy DM941 ] propose a model with a transition matrix that specifies 
state durations. 

The theory of Markov models (MM) and Hidden Markov models (HMM) are meth- 
ods of mathematically modelling time varying dynamics of certain statistical processes, 
requiring a weak set of assumptions yet allow us to deduce a significant number of prop- 
erties. MM and HMM model a stochastic process (or any system) as a set of states 
with each state possessing a set of signals o r observ ations. The model s have been used 
in d iverse applications such as econ omics SSS02], queuing theor y [SF06I] . engineer- 



ing 



Markov model: 



TGOlj and biological modelling jMGPGoj . Following Taylor [TK84I ] we define a 



Definition 1. A Markov model is a stochastic process X(t) with a countable set of 
states and possesses the Markov property: 



p(q, 



t+i = J | qi,q 2 ,-,qt 



p(q 



't+i = J I Qt 



(2) 
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where 

• q t is the Markov state (or regime) at time t of X(t); 

• % and j are specific Markov states. 

As time passes the process may remain or change to another state (known as state 
transition). The state transition probability matrix (also known as the transition kernel 
or stochastic matrix) A, with elements o^, tells us the probability of the process 
changing to state j given that we are now in state i, that is = p(qt+i = j \ Qt — i)- 
Note that is subject to the standard probability constraints: 

< dij < l,Vi,j, (3) 

oo 

E^' = !' Vi ( 4 ) 

j'=i 

We assume that all probabilities are stationary in time. From the definition of a MM 
the following proposition follows: 

Proposition 1. A Markov model is completely defined once the following parameters 
are known: 

• R ; the total number of regimes or (hidden) states; 

• state transition probability matrix A of size RxR. Each element is = p(qt+i = 
j\Qt = i), where i refers to the matrix row number and j to the column number of 
A; 

• initial (t=l) state probabilities 7Tj = p(qi = i),Wi. 

A hidden Markov model is simply a Markov model where we assume (as a modeller) 
we do not observe the Markov states. Instead of observing the Markov states (as 
in standard Markov models) we detect observations or time series data where each 
observation is assumed to be a function of the hidden Markov state, thus enabling 
statistical inferences about the HMM. Note that in a HMM it is the states which must 
be governed by a Markov process, not the observations and throughout the thesis we 
will assume one observation occurs after one state transition. 

Proposition 2. A hidden Markov model is fully defined when the parameter set {A, B, n} 
are known: 

• R 7 the total number of (hidden) states or regimes; 

• A, the (hidden) state transition matrix of size RxR. Each element is = 
p(Qt+i =j\qt =i); 

• initial (t=l) state probabilities iii = p(qi = i),Wi; 

• B, the observation matrix, where each entry is bj(O t ) = p(O t \j) for observation 
O t . For bj(O t ) is typically defined to follow some continuous distribution e.g. 
bj(O t ) ~M(uj,ipj). 
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2.2. Current Calibration Method: Hamilton Filter 

In financial mathematics or economic literature the standard calibration method 



for regime switching models is the Hamilton filter [Ham89l ] , which works by maximum 
likelihood estimation (MLE). MLE is a method of estimating a set of parameters of a 
statistical model (0) given some time series or empirical observations Oi,0 2 , ...,Ot- 
MLE determines by firstly determining the likelihood function £(0), then maximis- 
ing £(0) by varying through a search or an optimisation method. 

A statistical model with known parameter values can determine the probability 
of an observation sequence O = 0\0 2 ...Ot- MLE does the opposite; we numerically 
maximise the parameter values of our model such that we maximise the probability 
of the observation sequence O = 0\0 2 ...Ot- To achieve this the MLE method makes 
two assumptions: 

1. In maximising £(0) the local optimum is also the global optimum (although this 
is generally not true in reality) . The opti mal valu es for are in a search space of 



the same dimensions as 0. Hamilton in Ham94j gives a survey of various MLE 
maximisation techniques such as the Newton-Raphson method; 
2. The observations Oj, Ot are statistically independent. Note that for Markov 

models we assume the conditional observations 
(0 t |0 t _i), (O t -i|O t - 2 ), (0 t - 2 |0t-3)"- are independent. 

For a regime switching process the general likelihood function £(0) is: 

£(0) = /(O 1 |0)/(O 2 |0,O 1 )/(O 3 |0,O 1 ,O 2 ) 
••■ f(0 T \Q,0 1 ,0 2 ,...,0 T ^ 1 ), 

where /(0n|0) is the probability of On, given model parameters 0. Now by properties 
of logarithms we have: 

%(£(©)) = log(f(0 1 \Q)) + log(f(0 2 \Q,0 1 )) + --- (5) 
+ M/(Ot|0,Oi,O 2 ,-,Ot-i)). (6) 

Hamilton proposes a likelihood function for regime switching models, the Hamilton fil- 
ter. As an example, if we assume we have a two regime model with each regime having a 
lognormal return distribution, we wish to determine parameters = {ui, u 2 , (fix, tp 2 , a± 2 , a 2 i}. 
Note that in this simple HMM a 22 = 1 — a\ 2 and an = 1 ~ Q21 therefore we do not 
need to estimate a 22 , an in 0. 
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To obtain f(O t \Q) in equation (jBJ) for t > 1, Hamilton showed it could be calculated 
by a recursive filter. We observe the relation: 

2 2 

f{o t \e, o u o 2 , o t _o = £ £ /( gt , qt _ Xi o t \e, o u o t ^). (7) 

qt = l gt-i=l 

Now using the relation^ 

p(o,g|e)=p(o|e,g)p(o|e), (8) 

where Q = q\q 2 ... represents some arbitrary state sequence, we make the substitution 
/(g t ,g t _ 1 ,CMe,0 1 ,...,O t _ 1 ) 

= K?t-i|e,o 1 ,...,o t _ 1 ) xpfei^.e) x /(o t | gt ,e). (9) 

Therefore 

f(O t \Q,0 1 ,0 2 ,...,O t ^ 1 ) 

2 2 

= E E P(9t-i|©,0 1 ,..,0 t _ 1 )x j )(g t | gt _ lj e)x/(0 t |g t ,e), (10) 

<?t=l gt-i=l 

where 

• p(<Zi|<Zi-i> ©) — P(?t = i|?t-i = *> ©) represents the transition probability we 
wish to estimate; 

• f{Ot\qt — h @) = PiiPt) where pj(-) ~ M{u^ (pi) the Gaussian probability density 
function for state z, whose parameters Wj, pi we wish to estimate. 

The parameters 6 = {ui, u 2 , <fi, (p 2 , 012, ^21} are obtained by maximising the likelihood 
function using some chosen search method. 

To calculate f(O t \Q, Oi, 2 , O t -i) we require the probability 
p(qt-i\Q, 0±, 2 , Ot-i) in equation ( TTOT) (summed over two different values of q t -\ in 
the summations in equation ( flOl) ). This can be achieved through recursion, that is the 
probability p(q t -i\Q, 1; .., O t -i) can be obtained from p(q t _ 2 \Q, Oi, .., O t - 2 ): 

( in n n n \ E=i f(qt-u Qt~2 = h gt-ijg, d, -, O t _ 2 )) 
p(gt_i|e,Oi,0 2 ,...,O t _i) = ~fl~F^) — TTTi n — S ' ( n ) 



The denominator of equation f TTTT) is obtained from the previous period of 

f(O t \Q, Oi,0 2 , ...,0t-i) ( in other words /(0 t _i|6, Oi, 2 , O t _ 2 )) so by inspecting 



1 Note: following discussions with Prof.Rabiner [Rab08| on the equation forp(0,Q|8) it was con- 
cluded that the equation for p(0,Q\Q) in Rabiner's paper [Rab89l | is incorrect. 
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equation ( fT0l) we can see it is a function of p(g t __2|@, 0\, ...,O t -2). The numerator of 
equation ffTTT) is obtained from calculating equation © for the previous time period, 
which is also a function of p(g t _ 2 1 @, 0\, 0(_ 2 ). 

To start the recursion of equation ( TTTT) at p(<?i = z|Oi,0) we require f(Oi\Q). 
Hamilton assumes the Markov chain has been running sufficiently long enough so that 
we can make the following assumption about our observations 0\,0 2 , ...,Ot- Techni- 
cally, Hamilton assumes the observations 0\, 2 , Ot are all drawn from the Markov 
chain's invariant distribution. If a Markov chain has been running for a sufficiently 
long time, the following property of Markov chains can be applied: 

T]j = \imp(q t =j\q 1 =i), Vz, j = 1, 2, .., R, (12) 

t— >oo 

R 

where ^^Vj = ^-iVj > 0- (13) 
i=i 

The probability rjj tells us in the long run (t — > 00) the (unconditional) probability of 
being in state j and this probability is independent of the initial state (at time t=l). 
An important interpretation of rjj is as the fraction of time spent in state j in the long 
run. Therefore the probability of state j is simply rjj and so: 

/(Oil©) = /(<& = l,Oi|e) + /(gi = 2,Oi|e), (14) 
where /(gi = i,0i|6) = rjiP^Ox). (15) 

We can therefore calculate p(qi = i\Oi, 0): 

l \n Ik 1 = j »°il e ) (-\ a\ 

p{q 1 = i\o u Q) = — Jajje) — ' ( 6 > 

ViPi(Oi) , 17 , 



^iPi(Oi) +r] 2 p 2 {Oi)' 
Furthermore it can be proved for a two state HMM that: 

Vi = cW(ai2 + a 2 i), 
772 = 1-771- 

Therefore p(qi = z|Oi,©) can be obtained from estimating the parameter set = 
{ui, u 2 , ipi, v?2, O12, 021}, which is obtained by a chosen search method. 

The advantages of Hamilton's filter method are firstly we do not need to specify 
or determine the initial probabilities, therefore there are fewer parameters to estimate 
(compared to the alternative Baum- Welch method). Therefore the MLE parameter 
optimisation will be over a lower dimension search space. Secondly, the MLE equation 
is simpler to understand and so easier to implement compared to other calibration 
methods. 
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3. Baum- Welch Algorithm 

The Baum- Welch (BW) is a complete estimation method since it also provides 
the required optimisation method to determine the parameters by MLE. We will now 
explain the BW algorithm and to do so we must first explain the forward algorithm, 
which we will do now. 

3.1. Forward Algorithm 

The forward algorithm calculates p(0\M), the probability of a fixed or observed 
sequence O—OiO^-.-Ot, given all the HMM parameters denoted by M = {A,B,7r}. 
We recall from the definition of HMM that the probability of each observation p(O t ) 
will change depending on the state at time t (qt). Hence the most straightforward way 
to calculate p(0\M) is: 

p(0\M) = p(°>Q\ M )> ( 18 ) 

all Q 

= P{0\M,Q).p(Q\M), (19) 

all Q 

■ O qiq2 q2 

(02).. 

■ ■CLq T _ 1 q T bq T (0 T ), (20) 

all Q 

where p(0\M,Q) = b^O^.b^) b qT (0 T ). (21) 

Here "all Q" means all possible state sequences q\qi---qT that could account for obser- 
vation sequence O, 6(.)(0(.)) is defined in proposition [21 p(0\M, Q) is the probability of 
the observed sequence O, given it is along one single state sequence Q = qiq2---qr and 
for HMM M. We must sum equation ( 1201) over all possible Q state sequences, requiring 
R T computations and so this is computationally infeasible even for small R and T. 

To overcome the computational difficulty of calculating p(0\M) in equation (I2"U]) 
we apply the forward algorithm, which uses recursion (dynamic programming). The 
forward algorithm only requires computations of the order R 2 T and so is significantly 
faster than calculating equation fl2"U]) for large R and T. 

Let us define the forward variable Kt(i). 

K t (t)=p(0 1 2 ...O u q t = i\M). (22) 

Given the HMM M, «t(i) is the probability of the joint observation upto time t of 
0\02---O t and the state at time t is i i.e. q t = i. If we can determine ht{i) we can 
calculate p{0\M) since: 

R 

p{0\M) = Y,*T{i). (23) 

i=l 
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Now Kt+i(j) can be expressed in terms of Kt(i), therefore we can calculate n t+ i(j) 
by recursion: 



«t+iO') 



E 

i=l 



K t [i)a. 



ij 



bj(0 



t+i, 



1< t < T - 1. 



(24) 



The variable K t +i{j) in equation ( 1241 can be understood as follows: K t (j)a,ij is the 
probability of the joint event 0\....O t is observed, the state at time t is i and state j is 
reached at time t+1. If we sum this probability over all R possible states for i, we get 
the probability of j at t+1 accompanied with all previous observations from 0\0<i---O t 
only. Thus to get n t +i{j) we must multiply by bj{O t +i) so that we have all observations 
Ox...O t+1 . 

Therefore the recursive algorithm is as follows: 
1. Initialisation: 



Ki(i) = iTibi(Oi), 1 < i < R. 



(25) 



2. Recursion: 

«t+i(j) 

3. Termination: t+l=T. 

4. Final Output: 



' R 



i=l 



bj(0 



1< t < T-l. 



(26) 



R 



P (0\M) = J2^t( 



(27) 



8=1 



At t=l no sequence exists but we initialise the recursion with 7Tj to determine Ki(i). 

3.2. Baum- Welch Algorithm 

Having explained the forward algorithm we can now explain the BW algorithm. 
Using observation sequence O, the BW algorithm iteratively calculates the HMM pa- 
rameters M = {A,B,7r}. Specifically, BW estimates M = {a^-, bj(-), vrj}Vi, j, denoted 
respectively by M = {a^-, bj{-), tT;}, such that it maximises the likelihood of p(0\M). 
No method of analytically finding the globally optimal M exists. Howeve r it has been 
theoretically proven BW is guaranteed to find the local optimum 

Let us define ip t (hj)'- 



Rab89|. 



MhJ) = P(Qt = i,Qt+i =3 \o,M). 



(2? 
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The variable ipt{hj) is the probability of being in state i at time t and state j at time 
t+1, given the HMM parameters M and the observed observation sequence 0. We can 
re-express ipt(i,j) as: 

Mhj) = p(qt = i,qt+i=j\0,M), (29) 
p(q t = i, qt+i =j,Q\ M) 

p(0\M) • W 

Now we can re-express equation ( !30l) using the forward variable 

K t(i) = p(Oi02-..Ot,qt = i\M) and using analogously the so called backward variable 
Qt+iij)- 

Qt{i) = p(O t+1 O t+2 ...0 T \q t = i, M), (31) 
so that g t+ i(i) = p(O t+ 20t + 3...0 T \qt+i = i, Af). (32) 

The backward variable Qt{i) is the probability of the partial observed observation se- 
quence from time t+1 to the end T, given M and the state at time t is i. It is calculated 
in a sim ilar recursive method to the forward variable using the backward algorithm (see 



Rab89| for more details). Hence we can rewrite ipt{hj) as 



Mh3) ~ pj0\M) • (33) 

We can also rewrite the denominator p(0\M) in terms of the forward and backward 
variables, so that ipt(i,j) is entirely expressed in terms of K t (i), Oy, bj(O t+ i), Qt+i(j)'- 

R R 

p(0\M) = J2J2 K ^MOt + i)Qt + iU)- (34) 



Now let us define TAi): 



r*(i) = p{q t = i\0,M), (35) 

R 



E 

j'=i 



^Mhj)- (36) 



Equation fl36l) can be understood from the definition of ?p t (i,j) in equation (T29]) ; sum- 
ming ipt(i,j) over all j must give p(q t = i\0, M), the probability in state i at time t, 
given the observation sequence O and model M. Now if we sum T t (i) from t=l to T-l 
it gives us T(i), the expected number of transitions made from state i: 

T-l 

T(z) = J2m- (37) 
t=l 
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If we sum T t (i) from t=l to T it gives us the expected number of times state i is 
visited: 



r 



t=i 

We are now in a position to estimate M. The variable is estimated as the 
expected number of transitions from state i to state j divided by the expected number 
of transitions from state i: 

- aij = S» *<"•> ■ (39) 
The variable 7fj is estimated as the expected number of times in state i at time t=l: 

7ri = r x (i). (40) 

The variable bj(s) is estimated as the expected number of times in state j and observing 
a particular signal s, divided by the expected number of times in state j: 

m = ^T' (41) 

where T t (j)' is T t (j) with condition O t = s. 
We can now describe our BW algorithm: 

1. Initialisation: 

Input initial values of M (otherwise randomly initialise) and calculate p(0\M) 
using the forward algorithm. 

2. Estimate new values of M: 
Iterate until convergence: 

(a) Using current M calculate variables 7tt(i), £? t+1 (j) by the forward and back- 
ward algorithm and then calculate ip t (hj) as i n equation ( |33l) . 

(b) Using calculated i[) t (i,j) in (a) determine new estimates of M using equa- 
tions (|36|)-(j4T|). 

(c) Calculate p(0\M) with new M values using the forward algorithm. 

(d) Goto step 3 if two consecutive calculations of p(0\M) are equal (or converge 
within a specified range). Otherwise repeat iterations: goto (a). 

3. Output M. 
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The BW algorithm is started with initial estimates of M — (A, B, W). These estimates 
in turn are used to calculate the right hand side of equations (1551). (j4Til) and (j4*Tj) to 
give the next new estimate of M — (A, B, W). We consider the new estimate M n to be 
a better estimate than the previous estimate M p , if p(0\M n ) > p(0\M p ), with both 
probabilities calculated via the forward algorithm. In other words, we prefer the M 
that increases the probability of observation O occurring. 

If p(0\M n ) > p(0\M p ) then the iterative calculation is repeated with M n as the 
input. Note that at the end of step two, if the algorithm re-iterates then inputting the 
new M at step 2a means we will get a new set of M after executing 2b. The iteration 
is stopped when p{0\M n ) = p(0\M p ) or is arbitrarily close enough and at this point 
the BW algorithm finishes. 

Since the BW algorithm has been proven to always converge to the local optimum, 
the BW will output the local optimum. We also note that correct choice of R is 
important since p(0\M) changes as M changes for a fixed O, however this disadvantage 
is common to all MLE methods. 

3.3. Multivariate Gaussian Mixture Baum- Welch Calibration 

To account for the variety of empirical distributions possible for various assets and 
capturing asymmetric properties arising from volatility (such as fat tails), we model 
each regime's distribution by a two component multivariate Gaussian mixture (GM), 
which is a mixture of two multinormal distributions. 

Definition 2. (Multinormal Distribution) Let X = (Xi, X2---,X n ) be an n- dimensional 
random vector where each dimension is a random variable. Let u=(ui,U2---,u n ) repre- 
sent an n dimensional vector of means, X represent an n x n covariance matrix. We 
say X follows a multinormal distribution if 

X~A/;( W ,£), (42) 

which may be alternatively written as 

l Xl \ I ( Ul \ <f - Vln\\ 

V. ~Af n u. , y?... if... if... . 

\X n ) \\U n J \ipnl V... Vnn) J 

The probability of X is 

p(x) ^wiPy rap H (x - u)TE " (x - u) )' 

where det(S) denotes the determinant o/S. 



12 



(43) 



(44) 



Definition 3. (Multivariate Gaussian Mixture) A multivariate Gaussian mixture con- 
sists of a mixture of K multinormal distributions, spanning n- dimensions. It is defined 
by: 

X ~ ciJV n (ui, Si) + ... + c K N n (u K , Ejc), (45) 
where Ck are weights and 

k=K 

^c fc = l,c fc >0. (46) 
k=i 

The term p gmm (X) denotes the probability of a multivariate Gaussian mixture variable 
X and is defined as 

K 

p gmm (X) = ^c fe p fc (X), (47) 

k=l 

where p fc (X) ~ Af n (u fe , £ fc ). 

If we model a stochastic process X by a Gaussian mixture for each regime then for 
a given regime j we have: 

x ~ c jl tf a (u jl ,i: jl ) + ... + c jK N' I1 (u jK ,i: jK ). (48) 

The probability of X for a given regime j, p gmm (X)j, is: 

k=K 

PgmmPQj = ^C jk p jk (X). (49) 
k=l 

where 

• Cjfc are weights for each regime j and 



$>,fc = l,Vj. (50) 
fc=i 

Note that the dimensions of multivariate distribution n are independent of the number 
of mixture components K. 

For an n-asset portfolio X(t) = (Xi(t), X2(t), X n (t)), where Xi(t) represents 
the stock price of asset i, with each asset following a Gaussian mixture, the portfolio 
returns would be modelled by: 

dX/X ~ CjijV n (uji, Sji) + c j2 Af n (u j2 , S j2 )- (51) 
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For practial calibration purposes we set the multivariate observation vector O t to 
annual log returns: 



O t = log(X(t + At)/X(t)), 



(52) 



where At—1 year. 

Combining GM with HMM gives us a GM-HMM (Gaussian mixture HMM) model 
and the BW algorithm can be adapted to it: Gaussian mixture BW (GM-BW). For 
O t our observation (vector) at time t we model bj(0) by GM: 



6,(0) 



(53) 



The BW algorithm for calculating A, 7Tj remains the same; for B we have a GM. We 
would like to obtain the GM mixture coeffficents Cjk, mean vectors Uj^ and covari- 
ance matrices Tijk whose estimates are Cjk, Ujk and Xjfc resp ectively . These can be 
incorporated within the BW algorithm as detailed by Rabiner Rab89j |: 



Ef=i T t(j, k).(O t - Ujk)(Ot - u jk ) 

Ef=iEf=i r <U fc )' 



3 k 



where T t (j, k) 



CjkPjkjOt) 

Pgmm ( O 



t)j 



(54) 
(55) 
(56) 
(57) 



Here T t (j, k) is the probability at time t of being in state j with the k mixture component 
accounting for 0(. Using the same logic as in section [3] (for non mixture distributions) 
we can understand equations fl54|) - fl56|) . for example Cjk is the expected number of times 
the HMM k-th component is in state j divided by the expected number of times in state 

j- 

It is worth noting that a well known problem in maximum likelihood estimation of 
GM is that observations with low variances give extrem ely high likelihoods, in which 



case the likelihood function does not converge 



in the univariate case Messina and Toscani MT071 ] implement Ridolfi's and Idier's 



M' 



071 | . To overcome this problem 



RI02| penalised maximum likeliho od fun ction, which limits the likelihood value of 



observations. This is beneficial in 



MT07| ] because the observation time scales are of 



the order of days and therefore the variance of samples may approach zero. For our 
applications we calibrate the GM-HMM to annual return data, therefore the samples 
are unlikely to approach variances anywhere near zero. 
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3.4- Advantages of Baum- Welch Calibration 

The BW algorithm has significant advantages over the Hamilton filter. Firstly, the 
Hamilton filter requires observation data to be taken from the invariant distribution in 
order to estimate the parameters (see equation f)12p ). To obtain observations from the 
invariant distribution implies the number of state transitions approaches a large limit, 
so is not suited to Markov chains that have run for a short time. Furthermore, the 
time to reach the invariant distribution increases with the number of regimes R and 
the number of Gaussian mixtu res K. 

Psaradakis and Sola PS9q investigated the finite sample properties of the Hamilton 
filter for financial data. They concluded that samples of at least 400 observations are 
required for a simple two state regime switching model where each state's observation 
is modelled by a normal distribution. 

Secondly, the Hamilton filter has no method of estimating the initial state probabil- 
ities whereas the BW is able to take account of and estimate initial state probabilities. 
This has a number of important consequences: 

1. BW does not require observations from the invariant distribution and so can be 
calibrated to data of any observation length. 

2. the Hamilton filter cannot fully define the entire HMM model since the initial 
state probabilities are one of the key HMM parameters in the definition (see 
HMM definition in section [5]) . 

3. we cannot determine the probability of observation sequences p(0\M), since we 
require the initial state probabilities. This can be understood from the forward 
algorithm. 

4. we cannot determine the most likely state sequence that accounts for a given ob- 
servation sequence and HMM, which can be obtained by the Viterbi algorithm. 
The Viterbi algorithm tells us the most likely state sequen ce for a given observa- 
tion sequence and HMM parameters M (see Forney FJ73| for more information). 

5. without the initial state probabilities, we cannot simulate state sequences since 
the initial state radically alters the state sequence and its influence on the state 
sequence increases as the sequence size decreases. Consequently we cannot vali- 
date a model's feasibility by simulation. 

Note that BW estimates initial state probabilities independently of the transition prob- 
abilities, whereas in the Hamilton filter rji is a function of estimated transition proba- 
bilities. Hence BW is able to independently estimate more HMM parameters than the 
Hamilton filter. 
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Thirdly, to our knowledge the Hamilton filter cannot be applied to multivariate dis- 
tributions, nor more complicated univariate distributions than Gaussians. Particularly 
for financial applications, we use multivariate data to model portfolios and multivariate 
stoch astic vola tility is becoming an increasingly important research area (see Bauwens 



et al. BLR06| for a survey on multivariate GARCH). Hamilton has proposed a calibra- 



tion met hod for univariate mixture distributions, the Quasi-Bayesian MLE approach 



Ham9lj . yet this requires some prior knowledge regarding the reliability of observa- 
tions. The GM-BW calibrates a multivariate Gaussian mixture to multivariate data, 
thereby capable of modelling most empirically observed distributions. 

Fourthly, the GM-BW can calibrate time varying correlations. It is known that 
correlations a mongst random variables tend to be unstable with time; for example 



Buckley et al. |BSS07I | give evidence of covariances varying with time and model them 
as regime dependent. The GM-BW algorithm gives the covariance matrix for each 
regime and each regime is postulated to be linked to an economic state. Therefore, we 
can model and extract information on changing correlations with changing economic 
conditions. For instance, some stocks are considered to be strongly correlated with the 
economic cycle (known as cyclical stocks) e.g. British Airways, whereas other stocks 
are considered independent of the economy (known as defensive stocks) e.g. Tesco. 

Finally, the BW algorithm is a complete HMM estimation method whereas the 
Hamilton filter is not. Hamilton's method provides no method or guidance as to the 
optimisation algorithm to apply for finding the parameters from the non-linear filter, 
yet the solutions can be significantly influenced by the non-convex optimisation method 
applied. The BW algorithm includes an estimation method for the full HMM and a 
numerical optimisation scheme. Additionally, the BW method is guaranteed to find 
the local optimum. 

4. Numerical Experiment: Baum- Welch GM-HMM Calibration Results 

In this section we calibrate a 2-state regime switching model with 2 Gaussian com- 
ponents to S&P 500 data from 1976-1996. We fitted the model: 

dX/X ~ CjiAfiiujt, ifjt) + c j2 Afi(u j2 , tp j2 ),j e {1, 2}. (58) 

We set our observation to annual log returns: 

O t = log(X(t + At)/X(t)), (59) 

where At=l year and X(t) is the stock price. 
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4-1. Procedure 

Due to GM-BW's wide usage in engineering it has already been implem ented by 



numerous authors. We chose K. Murphy's Matlab implementation [Mur08j ] because 
it is considered one of the most standard and cited GM-BW programs. It also offers 
many useful features that are unavailable on other implementations e.g. the Viterbi 
algorithm for obtaining state sequences. 

The GM-BW algorithm finds the best GM-HMM parameters that maximise the 
likelihood of the observations, however, this involves searching a nonconvex search space 
and BW only finds the local optimum. Theoretically, the globally optimal parameters 
can be determined by initialising the GM-BW algorithm over every possible starting 
point, then the globally optimal parameters are those that give the highest likelihood. 
However, the GM-BW algorithm finds the locally optimal M = (A, B,7f) (where B is 
parameterised by 'Cj^ 1 Tp- k and Ujk), so that the calibration problem has a nonconvex 
solution search space over thirteen dimensions. 

Due to the high dimensionality of the parameter estimation problem for either 
the univariate or multivariate case, determining the optimal parameters by initialising 
through different starting points is impractical. Instead we concentrated our effort on 
finding good initial parameter estimates for M. This was to significantly increase the 
probability of finding the best GM-B W solution s, particularly as initialisation strongly 



influences the GM-BW optimisation |LDLK04 |. 



W Initialisation 

To initialise W, we assign a probability of 0.5 to state one if the first observed return is 
positive, with the probability increasing in value the more positive it is. 

A Initialisation 

Given the HMM structure was chosen due to its ability to capture long term and 
fundamental properties, we can initialise A based on the long term and fundamental 
properties we expect it to possess: 

- fo.6 0A\ , 
A = . 60 

\0.7 0.3/ 

Each state represents an economic regime so A can be interpreted as follows. The 
economy in the long term follows an upward drift, so we would expect it is more likely 
the HMM remains in state one rather than goto state two, given it is already in state 
one -hence we assign probability 0.6. Similarly, if the HMM is in state two we would 



17 



expect it is far more likely to return to state one than state two to capture the cyclical 
behaviour and in the long term the economy follows an upward trend, hence we assign 
probabilities 0.7 and 0.3. 

GM Initialisation (B ) 

The GM distribution fitting strongly affects the GM-BW algorithm optimisation, hence 
it must be satisfactorily initialised for GM-BW to provide acceptable results. However 
it is well known that GM distribution fitting in general (without any regime switching) 
is a non-trivial problem: 

• there are a large set of parameters to estimate. 

• there exists the issue of uniqueness, that is for a given non-parametric distribution 
there does not always exist a unique set of GM parameter values. 

• the flexibility of GM distributions to model virtually any unimodal or bimodal 
distribution means that it incorporates rather than rejects any noise in the data 
into the distribution. Therefore GM fitting is highly sensitive to noise. 

• parameter estimation is further complicated with regime switching and the fact 
we cannot identify with certainty the (hidden) state associated with each obser- 
vation. 

Rather than randomly initialise the GM parameters (as is done in Murphy's program) 
we approximately divided data into each regime by classifying positive returns as be- 
longing to state one, otherwise they belong to state two. We then fit a GM for each 
"regime's" data using a GM fitting program used by Lund University (stixbox). 

4-2. Results and Discussion 

We present the results of the Baum- Welch GM-HMM calibration to S&P 500 data 
from 1976-1996: 
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Table 1: Initial State Probabilities (7^) 



State (i) 


Probability 


1 
2 


1 x KT 6 
1 - 1 x 10" 6 



_ /0.78 0.22 
A 

"0.82 0.18, 



Table 2: Mixture Means ujk (%/year) 



Gaussian Component (k) 


State (j) 


1 


2 


M 


13.0 


-4.8 




28.0 


1.4 


Overall 


14.8 


-4.7 



Table 3: Mixture Standard Deviations ^pjk (%/year) 



Gaussian Component (k) 


State (j) 


1 


2 


M 


4.5 


5.6 




28.0 


110.0 


Overall 


11.6 


13.3 
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Table 4: Weighting Matrix (cjk) 



Gaussian Component (k) 


State (j) 


1 


2 


M 
M 2 


0.88 
0.12 


0.99 
0.01 



From table [21 we can infer that the BW algorithm has attributed state two as the 
down state since its over all mean is negative, unlike state one. Using the Markowitz's 
variance measure of risk Mar52j ] it is encouraging that we can conclude that the risk 
level in state two is higher than in state one (see table [3]), since a declining economy 
(state two) is a riskier economic state. Additionally, an increase in variance (and 
therefore volatility) with lower returns is consistent with the leverage effect. The 
initial state probabilities ¥ strongly suggest the HMM starts in state two and this is 
consistent with the data as the first year's return (see table [5]) is relatively low (1.2%). 

The transition matrix A is similar to the transition matrix theoretically postulated 
(for an economy); thus it is consistent with our theoretical expectations. The matrix 
A tells us that given the model is in state one it is likely to stay in that state most 
of the time, only transitioning to state two for 22% of the time. Additionally, in state 
two the model is most likely to return to state one (probability 0.78), thus captures 
the cyclical nature of the economy. 

To validate the quality of the GM-HMM calibration in terms of state sequence 
generation, we ran the Viterbi algorithm. Note that calibration under the Hamilton 
filter does not provide or enable state sequence estimation. The Viterbi algorithm gave 
the following state sequence result for in sample data 1976-96 in table EJ 
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Table 5: In Sample Regime Sequence Results 



Year 


Regime 


Empirical Annual Return (%) 


1976 


2 


1.2 


1977 


2 


-13.4 


1978 


1 


11.3 


1979 


1 


13.4 


1980 


1 


12.6 


1981 


2 


-7.3 


1982 


1 


18.8 


1983 


1 


11.7 


1984 


1 


9.5 


1985 


1 


16.5 


1986 


1 


25.8 


1987 


2 


-6.5 


1988 


1 


14.6 


1989 


1 


10.1 


1990 


1 


4.4 


1991 


1 


17.3 


1992 


1 


7.1 


1993 


1 


9.3 


1994 


2 


-2.4 


1995 


1 


30.2 


1996 


1 


21.2 



The regime sequence concurs with the empirical observations; negative or low re- 
turns were categorised into state two, whereas other returns were classed as state one. 
The results also illustrate the behaviour of the transition matrix A; generally remain- 
ing in state one (up state) whilst occasionally entering state two, in which case it 
quickly reverts back to state one. Hence the GM-HMM is able to capture the key 
characteristics of the economy, namely an upward trend and cyclicity. 

The Viterbi algorithm was also applied to out of sample data from 1997-2007. Again 
the GM-HMM was able to satisfactorily classify the years for each state, as shown in 
table El 
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Table 6: Out of Sample Regime Sequence Results 



Year 


Regime 


Empirical Annual Return (%) 


1997 


1 


22.1 


1998 


1 


26.6 


1999 


1 


8.6 


2000 


2 


-2.1 


2001 


2 


-18.9 


2002 


1 


-27.8 


2003 


1 


27.9 


2004 


1 


4.3 


2005 


1 


8.0 


2006 


1 


11.6 


2007 


2 


-2.6 



5. Conclusions 

This paper has shown the advantages of Baum- Welch calibration over the standard 
Hamilton filter method. Not only does the Baum- Welch method offer a complete 
calibration procedure but also is able to estimate the full set of HMM parameters, 
unlike the Hamilton filter. We have also validated the usage of the Baum- Welch method 
through numerical experiments on S&P 500 data in and out of sample. 
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